{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Q 表示法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在开始讨论 {func}`~tvm.tir.q_multiply_shift` 之前，先介绍下 [Q 格式计数法](https://en.wikipedia.org/wiki/Q_(number_format)) 表示定点数。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Q 表示法是一种指定二进制定点数格式参数的方法。例如，在 Q 表示法中，由 `Q8.8` 表示的数字格式意味着此格式中的定点数具有 8 位整数部分和 8 位小数部分。\n",
    "\n",
    "## Q 表示法的定义"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{admonition} Texas Instruments 版本\n",
    "[Texas Instruments](https://en.wikipedia.org/wiki/Texas_Instruments) 定义的 Q 表示法，由字母 Q 后跟一对数字 `m.n` 组成，其中 `m` 表示用于值的整数部分的位数，`n` 表示小数位的位数。\n",
    "\n",
    "默认情况下，该表示法描述了有符号二进制定点格式，未缩放的整数值以补码格式存储，用于大多数二进制处理器中。第一位总是给出值的符号（1 = 负数，0 = 非负数），并且不计入 `m` 参数。因此，使用的总位数为 `1 + m + n`。\n",
    "\n",
    "例如，`Q3.12` 表示总位数为 16 位的有符号二进制定点数，包括符号位、三位整数部分和 12 位小数部分。也就是说，隐式乘以缩放因子 $2^{-12}$ 的 16 位有符号（补码）整数。\n",
    "\n",
    "特别是，当 `n` 为零时，数字只是整数。如果 `m` 为零，除符号位外的所有位都是小数位；则存储数字的范围是从 $[-1.0, +1.0)$。\n",
    "\n",
    "`m` 和小数点可以省略，在这种情况下，它们会根据存储值的变量或寄存器的大小的推断得出。因此，`Q12` 表示具有任意位数的有符号整数，隐式地乘以 $2^{-12}$。\n",
    "\n",
    "字母 `U` 可以添加到 `Q` 前面，表示无符号二进制定点格式。例如，`UQ1.15` 描述为隐含缩放因子为 $2^{-15}$ 的无符号 16 位整数，其范围 $[0.0, \\cfrac{2^{16}-1}{2^{15}})$\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{admonition} ARM 版本\n",
    "ARM 使用的 Q 表示法的变体中，`m` 数包括符号位。例如，16 位有符号整数在 TI 变体中表示为 `Q15.0`，但在 ARM 变体中表示为 `Q16.0`。\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Qm.n` 或者 `UQm.n` 格式的分辨率（连续值之间的差）总是 $2^{-n}$。可表示值的范围取决于所使用的符号：\n",
    "\n",
    "\n",
    "| Notation | Texas Instruments Notation | ARM Notation |\n",
    ":-:|:-:|:-:|\n",
    "| Signed Q*m*.*n* | $[-2^m, 2^m - 2^{-n}]$ | $[-2^{m-1}, 2^{m-1} - 2^{-n}]$ |\n",
    "| Unsigned UQ*m*.*n* | $[0, 2^m - 2^{-n}]$ |$[0, 2^m - 2^{-n}]$ |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\sum_{k=0}^{m-1} 2^k + \\sum_{k=1}^{n} 2^{-k} = (2^m - 1) + (1 - 2^{-n}) = 2^m - 2^{-n}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "例如，`Q15.1` 格式需要 `15+1 = 16` 位，分辨率为 $2^{−1} = 0.5$，并且可表示的值范围从 $−2^{14} = −16384.0$ 到 $+2^{14} − 2^{−1} = +16383.5$。在十六进制中，负值范围从 `0x8000` 到 `0xFFFF`，后面是非负值从 `0x0000` 到 `0x7FFF`。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Q 表示法的数学运算\n",
    "\n",
    "Q 表示法是两个整数的比率：分子保存在存储器中，分母 $d$ 等于 $2^n$。\n",
    "\n",
    "比如，在 `Q8` 中，分母是 $2^8 = 256$。对于 $1.5$ 等于 $384/256$，这是将 $1.5$ 转换为 `Q8` 数。在这种情况下，可以将 $1.5$ 视为 $15/10$,然后将其转换 为Q8 形式。首先，需要找到一个数，使得当将其除以 10 时，结果的分母是 $2$ 的 $8$ 次方。这个数就是 $15 * 2^8 = 15 * 256 = 3840$。所以，$15/10 = 3840/1000$。然后，我们再将结果除以 $2$，得到 $384/1000$。这就是 $384/256$ 的 `Q8` 表示形式。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果要保持 Q 数的基数（$n$保持不变），则 Q 数数学运算必须保持分母 $d$ 不变。以下公式显示了对一般 Q 数 $N1$ 和 $N2$ 的数学运算。（如果我们考虑上述示例，则 $N1$ 为 $384$，$d$ 为 $256$。）"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\cfrac{N1}{d} + \\cfrac{N2}{d} = \\cfrac{N1+N2}{d} \\\\\n",
    "\\cfrac{N1}{d} - \\cfrac{N2}{d} = \\cfrac{N1-N2}{d} \\\\\n",
    "(\\cfrac{N1}{d} \\times \\cfrac{N2}{d}) \\times d = \\cfrac{N1 \\times N2}{d} \\\\\n",
    "(\\cfrac{N1}{d} / \\cfrac{N2}{d}) / d = \\cfrac{N1/N2}{d}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "由于分母是 $2$ 的幂，乘法可以作为算术左移实现，除法可以作为算术右移实现；在许多处理器上，移位比乘法和除法更快。\n",
    "\n",
    "为了保持准确性，中间的乘法和除法结果必须是双精度的，并且在转换回所需的 Q 数之前必须小心处理四舍五入中间结果。\n",
    "\n",
    "两个不同 Q 格式基底的数字也可以相乘除，相乘除的数字也可以用另一个基底表示。以下是二个 Q 格式数字 $N1$(分母$d_1$) 和 $N2$(分母$d_2$) 的运算，运算结果的分母是 $d_{3}$：\n",
    "\n",
    "$$\n",
    "(\\cfrac{N1}{d_1} \\times \\cfrac{N2}{d_2}) \\times \\cfrac{d_1 d_2}{d_3} = \\cfrac{N1 \\times N2}{d_3} \\\\\n",
    "(\\cfrac{N1}{d_1} / \\cfrac{N2}{d_2}) / \\cfrac{d_1 d_2}{d_3} = \\cfrac{N1 / N2}{d_3}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "c++"
    }
   },
   "source": [
    "若用 C 语言，相同 Q 格式基底数字四则运算对应的程式如下（以下的 Q 是表示小数部分的位元数）："
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Q 数加法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "int16_t q_add(int16_t a, int16_t b)\n",
    "{\n",
    "    return a + b;\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "有饱和："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "int16_t q_add_sat(int16_t a, int16_t b)\n",
    "{\n",
    "    int16_t result;\n",
    "    int32_t tmp;\n",
    "\n",
    "    tmp = (int32_t)a + (int32_t)b;\n",
    "    if (tmp > 0x7FFF)\n",
    "        tmp = 0x7FFF;\n",
    "    if (tmp < -1 * 0x8000)\n",
    "        tmp = -1 * 0x8000;\n",
    "    result = (int16_t)tmp;\n",
    "\n",
    "    return result;\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "浮点数有 ±Inf，但 Q 格式没有，若不进行饱和处理，二个很大的正数相加，可能会变成很大的负数。若是用组合语言，可以用 Signed Overflow 旗标来避免 C 语言实现时需要的型态转换。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Q 数减法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "int16_t q_sub(int16_t a, int16_t b)\n",
    "{\n",
    "    return a - b;\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Q 数乘法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "// precomputed value:\n",
    "int Q = 8;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "int K = 1 << (Q - 1);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "// saturate to range of int16_t\n",
    "int16_t sat16(int32_t x)\n",
    "{\n",
    "\tif (x > 0x7FFF) return 0x7FFF;\n",
    "\telse if (x < -0x8000) return -0x8000;\n",
    "\telse return (int16_t)x;\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "int16_t q_mul(int16_t a, int16_t b)\n",
    "{\n",
    "    int16_t result;\n",
    "    int32_t temp;\n",
    "\n",
    "    temp = (int32_t)a * (int32_t)b; // result type is operand's type\n",
    "    // Rounding; mid values are rounded up\n",
    "    temp += K;\n",
    "    // Correct by dividing by base and saturate result\n",
    "    result = sat16(temp >> Q);\n",
    "\n",
    "    return result;\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Q 数除法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "int16_t q_div(int16_t a, int16_t b)\n",
    "{\n",
    "    /* pre-multiply by the base (Upscale to Q16 so that the result will be in Q8 format) */\n",
    "    int32_t temp = (int32_t)a << Q;\n",
    "    /* Rounding: mid values are rounded up (down for negative values). */\n",
    "    /* OR compare most significant bits i.e. if (((temp >> 31) & 1) == ((b >> 15) & 1)) */\n",
    "    if ((temp >= 0 && b >= 0) || (temp < 0 && b < 0)) {\n",
    "        temp += b / 2;    /* OR shift 1 bit i.e. temp += (b >> 1); */\n",
    "    } else {\n",
    "        temp -= b / 2;    /* OR shift 1 bit i.e. temp -= (b >> 1); */\n",
    "    }\n",
    "    return (int16_t)(temp / b);\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\t| Q15 | Q0.15 0.999969 | -1.000000 |\n",
      "\t| Q14 | Q1.14 1.999939 | -2.000000 |\n",
      "\t| Q13 | Q2.13 3.999878 | -4.000000 |\n",
      "\t| Q12 | Q3.12 7.999756 | -8.000000 |\n",
      "\t| Q11 | Q4.11 15.999512 | -16.000000 |\n",
      "\t| Q10 | Q5.10 31.999023 | -32.000000 |\n",
      "\t| Q9 | Q6.9 63.998047 | -64.000000 |\n",
      "\t| Q8 | Q7.8 127.996094 | -128.000000 |\n",
      "\t| Q7 | Q8.7 255.992188 | -256.000000 |\n",
      "\t| Q6 | Q9.6 511.984375 | -512.000000 |\n",
      "\t| Q5 | Q10.5 1023.968750 | -1024.000000 |\n",
      "\t| Q4 | Q11.4 2047.937500 | -2048.000000 |\n",
      "\t| Q3 | Q12.3 4095.875000 | -4096.000000 |\n",
      "\t| Q2 | Q13.2 8191.750000 | -8192.000000 |\n",
      "\t| Q1 | Q14.1 16383.500000 | -16384.000000 |\n",
      "\t| Q0 | Q15.0 32767.000000 | -32768.000000 |\n"
     ]
    }
   ],
   "source": [
    "int16_t q_max = 0x7fff;\n",
    "int16_t q_min = 0x8000;\n",
    "\n",
    "float f_max = 0;\n",
    "float f_min = 0;\n",
    "for (int8_t i=15; i>=0; i--) {\n",
    "    f_max = (float)q_max /pow(2, i);\n",
    "    f_min = (float)q_min /pow(2, i);\n",
    "    printf(\"\\t| Q%d | Q%d.%d %f | %f |\\r\\n\",\n",
    "           i, (15-i), i, f_max, f_min);\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "// 《Fast Inverse Square Root》\n",
    "float Q_rsqrt( float number )\n",
    "{\n",
    " long i;\n",
    " float x2, y;\n",
    " const float threehalfs = 1.5F;\n",
    "\n",
    " x2 = number * 0.5F;\n",
    " y   = number;\n",
    " i   = * ( long * ) &y;   // evil floating point bit level hacking\n",
    " i   = 0x5f3759df - ( i >> 1 ); // what the fuck?\n",
    " y   = * ( float * ) &i;\n",
    " y   = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration\n",
    " // y   = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed\n",
    "\n",
    " #ifndef Q3_VM\n",
    " #ifdef __linux__\n",
    "   assert( !isnan(y) ); // bk010122 - FPE?\n",
    " #endif\n",
    " #endif\n",
    " return y;\n",
    "}  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "C++14",
   "language": "C++14",
   "name": "xcpp14"
  },
  "language_info": {
   "codemirror_mode": "text/x-c++src",
   "file_extension": ".cpp",
   "mimetype": "text/x-c++src",
   "name": "c++",
   "version": "14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
